Application query conversion

ABSTRACT

A set of potential search-query terms can be identified based on empirical queries for apps. For each potential search-query term, a subset of documents within a set of documents can be identified based on apps that users were likely to click on or download following entry of a search query with a comparable or same term. One or more other indicator terms can be identified as being related to the potential search-query term based on the one or more second indicator terms being prevalent within the subset of documents. Upon receipt of a subsequent search query, a search can then be performed using both a term within the search query and one or more related other indicator terms.

BACKGROUND

This application is a divisional of co-pending U.S. application Ser. No.13/599,722 filed on Aug. 30, 2012.

The present disclosure relates generally to receiving a query forsoftware applications and identifying one or more terms related to, butnot included within, the query to use while searching a database. Therelated terms can be identified based on user interactions followingsimilar or same queries.

In recent years, application software or “apps” have become increasinglypopular. As professional and amateur developers develop apps at animpressive rate, a great variety of useful and/or amusing apps areavailable to users. However, the large number of apps also carries asubstantial disadvantage, in that it can make it more difficult for auser to search for an app with a specific utility or function.

Users are limited in that they can only enter a finite number of wordswhen entering a search query. Thus, even though they may wish to receivean app that relates to a complex concept, they must identify arelatively short number of search-query words in an attempt to brieflycapture the concept. Many apps may have a title or description includingone or more query terms. Meanwhile, frequently, most of these apps donot pertain to the concept that the user had in mind. Returning all appswith terms from the query may thus frustrate the user and waste histime.

Accordingly, it is desirable to provide systems and methods forperforming searches extending beyond a purely textual search tied tosearch-query terms, such that more relevant apps can be identified andpresented to a user.

SUMMARY

Embodiments described herein can efficiently and effectively respond tousers'search queries. When a user searches for an application (“app”),he can enter one or more search-query terms. The user can select thequery terms in an attempt to capture a concept related to a function ofa desired app. Textual searches tied merely to the query terms canproduce undesirable results, such as returning largely irrelevant appsor ranking the most suitable apps below others. Thus, embodiments of thesubject application can identify one or more other terms related to thesearch-query terms and performs the app search or a ranking ofidentified apps using the other indicator terms.

The indicator terms can be identified by analyzing words in a set orsubset of training documents associated with apps. In variousembodiments, the set of training documents can include all availableapps, a sampling of available apps, all apps available at a past timepoint, or a sampling of apps available at a past time point. Either theentire set or a subset of the training documents can be used to identifythe indicator terms.

For example, a subset of documents can include documents responsive to atextual query using the search-query terms. As another example, a subsetof documents can include documents which users interacted with (e.g.,selected for more information or downloaded) following a query usingcomparable or same query terms. Related indicator terms can be definedas those frequently appearing throughout the set of documents. Forexample, related indicator terms may be those appearing at least one ina large fraction of the documents or those appearing more than athreshold number of times across the set.

In some instances, the entire set of documents (e.g., documentsassociated with all apps available to user) is analyzed in order toidentify a relationship structure associated with a plurality of termpairs. Each pair can be assessed to estimate whether a given query termis related to a given indicator term. Terms in the relationshipstructure can be identified, e.g., as all noun and verb terms within theset of documents or terms appearing within the set more than a thresholdnumber of times. A pairwise analysis between each identified term can beperformed to quantify, e.g., the probability that one term of the pairwill appear within a document given that the other term of the pairappears. When a user inputs a search query including query terms,related indicator terms can be identified based on terms associated withhigh relatedness metrics (e.g., based on frequent co-occurrence) withrespect to one or more query terms.

By considering related indicator terms, search quality can be improved.Apps that have high term counts for a particular search-query term butinclude few or no related indicator terms can be eliminated from searchresults or assigned a low rank. Meanwhile, apps that have low termcounts for a particular search-query term but include many relatedindicator terms can be returned in search results and/or highly ranked.

According to one embodiment, a method can be provided of responding to asearch query requesting relevant software applications from a databasestoring software applications and a set of documents. Each document canbe associated with a software application. A subset of documents withinthe set of documents based on empirical user actions can be identifiedwith a server. The empirical user actions can involve selecting asoftware application in the subset in response to receiving results forqueries including a first query term. A second indicator term thatoccurs within the subset of documents can be identified. A datastructure can be updated with the server to associate the first queryterm with the second indicator term. The search query can be received atthe server from an electronic device of a user. The search query caninclude the first query term. The data structure can be accessed withthe server to identify the second indicator term associated with thefirst query term. Documents within the database can be searched with theserver to identify relevant software applications based on the firstquery term and the second indicator term.

According to another embodiment, a computer product can be provided thatincludes a non-transitory computer readable medium storing a pluralityof instructions that when executed control a computer system to respondto a search query requesting relevant software applications from adatabase storing software applications and a set of documents. Eachdocument can be associated with a software application. The instructionscan include identifying, with a server, a subset of documents within theset of documents based on empirical user actions. The empirical useractions can involve selecting a software application in the subset inresponse to receiving results for queries including a first query term.The instructions can also include identifying a second indicator termthat occurs within the subset of documents and updating, with theserver, a data structure to associate the first query term with thesecond indicator term. The instructions can further include receiving,at the server, the search query from an electronic device of a user, thesearch query including the first query term and accessing, with theserver, the data structure to identify the second indicator termassociated with the first query term. The instructions can still furtherinclude searching, with the server, documents within the database toidentify relevant software applications based on the first query termand the second indicator term.

According to yet another embodiment, a method of defining a datastructure to use for expanding a search query requesting relevantsoftware applications from a database storing software applications anda set of documents, each document being associated with a softwareapplication. A subset of documents within the set of documents can beidentified with a server based on empirical user actions. The empiricaluser actions can involve selecting a software application in the subsetin response to receiving results for queries including a first queryterm. A set of second indicator terms can be identified. For each secondindicator term, the server can determine how frequently the secondindicator term occurs within the subset of documents and can define arelatedness metric between the first query term and the second indicatorterm based on the determined occurrence frequency. The data structurecan be constructed that indicates the relatedness metric between thefirst query ten in and each second indicator term in the set of secondindicator terms.

According to still another embodiment, method of defining a datastructure to use for expanding a search query requesting relevantsoftware applications from a database storing software applications anda set of documents can be provided. E ach document can be associatedwith a software application. A set of N potential query terms can beidentified with a server by analyzing words in the set of documents, Nbeing an integer greater than 100. A co-occurrence matrix can be definedas an N×N array of matrix elements. Each row can correspond to arespective one of the N potential query terms and each column cancorrespond to a respective one of the N potential query terms. For eachpotential query term I, a subset of documents within the set ofdocuments can be identified with the server in which the potential queryterm I exists. Further, for each of the N−1 other potential query terms,the subset of documents can be identified with the server to identify arespective number of times the respective other potential query term Joccurs in the subset of documents and a co-occurrence scorecorresponding to matrix element can be identified with the server basedon the respective number. Thus, the co-occurrence matrix can beconstructed with the server using the co-occurrence scores.

Other embodiments are directed to systems and computer readable mediaassociated with methods described herein.

These and other embodiments of the invention along with many of itsadvantages and features are described in more detail in conjunction withthe text below and attached figures.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a system for performing searches for apps.

FIG. 2 is a simplified block diagram of an implementation of a deviceconfigured to receive user search queries according to an embodiment ofthe present invention.

FIG. 3 is a simplified block diagram of an implementation of remoteserver for processing app searches according to an embodiment of thepresent invention.

FIG. 4 is a flow diagram of a process for determining indiciatir termsrelated to query terms according to an embodiment of the presentinvention.

FIGS. 5A-5C are examples of relationship structures.

FIG. 6 is a flow diagram of a process for using a relationship structureto respond to a query

FIG. 7 is a flow diagram of a process for determining indiciator termsrelated to query terms according to an embodiment of the presentinvention.

FIG. 8 is a flow diagram of a process for determining indiciator termsrelated to query terms according to an embodiment of the presentinvention.

FIG. 9 is a flow diagram of a process for performing a search for adigital file according to an embodiment of the present invention.

DETAILED DESCRIPTION

Certain embodiments of the present invention can expand or alter termsused while searching through documents in an app database. The appdatabase can include subdatabases, such as a first subdatabase storingapps and a second subdatabase storing documents associated with theapps. A set of potential query terms can be identified based onfrequently entered empirical query terms or based on terms (e.g., nounsand verbs) appearing in a set of app-related documents. For eachpotential query term, a subset of documents relevant to the term can beidentified. For example, the subset can include all documents thatinclude the potential query terms. As another example, the subset caninclude documents which users previously interacted with following acomparable or same query.

Using the subset, a data structure can be constructed. The datastructure can include associations between each potential query term andone or more other terms. In some instances, the data structure includesa list of related indicator terms. In some instances, the data structureincludes a list of related indicator terms, each related indicator termbeing associated with a score (e.g., the score depending on howfrequently the related indicator term co-occurred with the potentialquery term in the subset). In the latter instance, the data structurecan include a vector (e.g., associating one potential query term withmultiple related indicator terms) or a matrix (e.g., associating each ofa number of potential query terms with multiple other potential relatedindicator terms).

Subsequently, when a search query is received, related indicator termscan be identified based on the data structure. A database can besearched using the search-query terms, and, in some instances, using therelated indicator terms. The database can include, e.g., the set ofdocuments or a larger set. The search results can be ranked using aranking algorithm, which can depend on the related indicator terms(e.g., ranking results higher if they include many related indicatorterms). The ranked search results can then be presented to a user.

I. Introduction

FIG. 1 illustrates a system 100 for performing searches for apps. Thesearches can be performed at device 105. Device 105 can executesoftware, such as operating software or program software, that enables auser to perform an app search and receive results of the search (e.g.,as a list). Device 105 can include mobile devices, which can include anydevice likely to be carried on a person of a user and capable ofperforming app searches as described herein. Device 105 can include anelectronic device, mobile phone, smartphone, tablet computer, laptopcomputer, or desktop computer.

Using a user-input component, a user can input a search query intodevice 105. For example, a user can insert search-query terms using atouchscreen keypad, non-touchscreen keypad, touchscreen buttons,non-touchscreen buttons, or a mouse. Upon entry of the search query, thequery can appear within a query box 115. In this illustration, a user ofone mobile device 105 entered search-query text of “email.”

The search query can be transmitted (e.g., wirelessly transmitted) fromdevice 105 to a remote server 150. Remote server 150 can use query termsof the search queries to search for apps associated with data in an appdatabase 160. App database 160 can include a dynamic database, in whichdata related to new apps are regularly or continuously added to thedatabase. At least some apps can be contributed by third-party appdevelopers.

App database 160 can include titles, descriptions, metadata, downloadfrequency or count, user ranking, and/or the apps themselves. Some orall of this information can be included in an app document 170. FIG. 1shows an example of two hypothetical app documents 170 a and 170 b. Eachapp document 170 pertains to a different app. App document 170 a isassociated with a “Best Mail” app, and app document 170 b is associatedwith a “Pool Game” app. The app document includes a description. Thedescription can be included, e.g., as metadata associated with the app.The description can be presented to users, such that users can decidewhether to download the app.

The depicted example illustrates the problem with using a puretextual-based search technique using on search-query terms. In thisinstance, the Mail app document 170 a includes the term “email” once inits title and description. Meanwhile, the pool game app document 170 bincludes the term “email” three times in its description. While the puretextual-based query search would therefore rank the pool game above theMail app, it is likely that a user entering the “email” query wouldprefer the reverse ranking.

As further described below, remote server 150 can identify otherindicator terms related to the search-query term of “email”. Thus, e.g.,“account,” “reply,” “message,” “filter,” and “compose” can all beidentified as related indicator terms. A search or ranking sensitive tothese related indicator terms can therefore identify the Mail appdocument 170 a as being more relevant to the “email” search as comparedto the pool game app document 170 b. Remote server 150 can transmit(e.g., wirelessly transmit) a search results (e.g., includinginformation in documents 170) to device 105. In this instance, thesearch results can include data associated with the Mail app and notwith the pool game app, or the search results can include dataassociated with both apps, though the pool game app can be ranked belowthe Mail app. Device 105 can present (e.g., display) some or all of therespective results to a user (e.g., in a list). A user can select an appof the displayed apps and further request a download of the app orpurchase the app.

II. Device and Server

FIG. 2 is a simplified block diagram of an implementation of device 105configured to receive user search queries according to an embodiment ofthe present invention. Device 105 includes a processing subsystem 202, astorage subsystem 204, a user input device 206, a user output device208, and a network interface 210.

Processing subsystem 202, which can be implemented as one or moreintegrated circuits (e.g., e.g., one or more single-core or multi-coremicroprocessors or microcontrollers), can control the operation ofdevice 105. In various embodiments, processing subsystem 202 can executea variety of programs in response to program code and can maintainmultiple concurrently executing programs or processes. At any giventime, some or all of the program code to be executed can be resident inprocessing subsystem 202 and/or in storage subsystem 204.

Through suitable programming, processing subsystem 202 can providevarious functionality for device 105. For example, processing subsystem202 can execute software (e.g., operating software) to allow a user toinput search-query terms via user input device 206, to transmitsearch-query terms to remote server 150 via network interface 210, toview search results via user output, and/or interact with search resultsvia user input device 206 and user output device 208 (e.g., to select,download or purchase a search-result app). Processing subsystem 202 canfurther execute one or more apps identified in response to searchqueries, downloaded, and stored in a local app database 260.

Storage subsystem 204 can be implemented, e.g., using disk, flashmemory, or any other storage media in any combination, and can includevolatile and/or non-volatile storage as desired. In some embodiments,storage subsystem 204 can store one or more apps, stored in local appdatabase 260, to be executed by processing subsystem 202. These apps caninclude apps downloaded by a user (e.g., via network interface 210) andapps identified based on search-query results. Programs and/or data canbe stored in non-volatile storage and copied in whole or in part tovolatile working memory during program execution.

A user interface can be provided by one or more user input devices 206and one or more user output devices 208. User input devices 206 caninclude a touch pad, touch screen, scroll wheel, click wheel, dial,button, switch, keypad, microphone, or the like. User output devices 208can include a video screen, indicator lights, speakers, headphone jacks,or the like, together with supporting electronics (e.g.,digital-to-analog or analog-to-digital converters, signal processors, orthe like). A user can operate input devices 206 to invoke thefunctionality of device 105 and can view and/or hear output from device105 via output devices 208.

Network interface 210 can provide voice and/or data communicationcapability for device 200. For example, network interface 210 canprovide device 105 with the capability of communicating with remoteserver 150. In some embodiments network interface 210 can include radiofrequency (RF) transceiver components for accessing wireless voiceand/or data networks (e.g., using cellular telephone technology,advanced data network technology such as 3G, 4G or EDGE, WiFi (IEEE802.11 family standards, or other mobile communication technologies, orany combination thereof), and/or other components. In some embodimentsnetwork interface 210 can provide wired network connectivity (e.g.,Ethernet) in addition to or instead of a wireless interface. Networkinterface 210 can be implemented using a combination of hardware (e.g.,antennas, modulators/demodulators, encoders/decoders, and other analogand/or digital signal processing circuits) and software components.

FIG. 3 is a simplified block diagram of an implementation of remoteserver 150 for processing app searches according to an embodiment of thepresent invention. Remote server 150 includes a processing subsystem302, storage subsystem 304, a user input device 306, a user outputdevice 308, and a network interface 310. Storage subsystem 304, userinput device 306, user output device 308 and network interface 310 canhave similar or identical features as storage subsystem 204, user inputdevice 206, user output device 208 and network interface 210 of device105 described above.

Processing subsystem 302, which can be implemented as one or moreintegrated circuits (e.g., a conventional microprocessor ormicrocontroller), can control the operation of remote server 150. Invarious embodiments, processing subsystem 302 can execute a variety ofprograms in response to program code and can maintain multipleconcurrently executing programs or processes. At any given time, some orall of the program code to be executed can be resident in processingsubsystem 302 and/or in storage subsystem 304.

Through suitable programming, processing subsystem 302 can providevarious functionality for remote server 150. Thus, remote server 150 canprocess search queries input at device 150 in order to identify searchresults. In some instances, processing subsystem 302 identifies thesearch results using one or more related indicator terms identified bythe processing subsystem 302 at a previous time (e.g., during a traininginterval) or in real-time in response to a search query. Processingsubsystem 302 can identify the related indicator terms by identifying asubset of documents from a set of documents a training set or alldocuments available at a particular time point) stored within appdatabase 160. The subset of documents can include those with an actualor potential search term or documents which users interacted withsubsequent to receiving results for a comparable or same search term.The related indicator terms can be determined in advance of processing asearch query (e.g., during a training interval) or in real-time inresponse to a search query. It will thus be understood that disclosuresherein that refer to a potential search-query term or the like can alsoapply to an actual search-query term and the converse.

After the subset of documents is identified, processing subsystem 302can identify the related indicator terms by determining which otherterms occur within the subset of documents, In some instances, thesubset of documents includes one or more search-query terms. Thus,determining other terms that occur within the subset of documents canamount to determining other terms that co-occur with the one or moresearch-query terms. Determination of the other terms can include, e.g.,counting a number of documents within the subset that include a specificterm or counting a total number of occurrences of a specific term withinthe document subset The count can be normalized, e.g., based on a numberof documents within the subset or a number of words (generally or of aparticular type) in the subset. In some instances, the other relatedindicator terms are constrained to be of a particular type. For example,the constraint may indicate that the other related indicator terms mustbe a noun, must be a verb, or must be a noun or verb. The constraint mayindicate that the other related indicator terms cannot be, e.g., anarticle, a preposition, an adverb or an adjective.

In some instances, the determination of related indicator terms candepend on empirical data, such as empirical data identifying users'interactions with search results. For example, the subset of documentscan consist of or include those associated with search results that auser clicked on or selected for download following a search queryincluding or consisting of a comparable or same search-query term. Asanother example, a document within a subset can be weighted according toa frequency which a user clicked on or selected for download theassociated search result (e.g., across all searches or across searchqueries with a comparable or same search-query term), such that otherwords in documents associated frequently selected or downloaded apps areassigned relatively high weights. Thus, processing subsystem 302 cancollect the empirical data and store it in an empirical query database316. The empirical data can include search-query terms, search resultsresponsive to the search query, and/or user interactions (e.g., clickedon results, view times, or downloaded apps) subsequent to receipt of thesearch results. Empirical query database 316 can be updated to remove orreduce the weight of older data and to add and/or weight newer data.

Processing subsystem 302 can store some or all term relationships in adata structure, such as a term relationship structure 318. Termrelationship structure 318 can associate specific search-query termswith one or more other terms (e.g., potential related indicator terms).In some instances, the mere association of a search-query term with oneor more other terms indicates that the other terms are indicator terms(e.g., indicative of an intent of a query) and/or related to thesearch-query term. For example, the associated other terms can includeonly those that satisfy a relatedness criterion, such as having arelatedness metric exceeding a threshold. In some instances, the otherterms are assigned a weight to identify a degree to which they arerelated to the search-query term.

Term relationship structure 318 can include an array, such as atwo-dimensional array. For example, the term relationship structure caninclude a list of terms related to each of a set of potential queryterms. The list can be of a same or different length across the set ofpotential query terms. In some instances, term relationship structure318 indicates a numeric relationship between each of a set of potentialquery terms and one or more second terms (e.g., each of a set of secondterms). Term relationship structure 318 can include a sparse array, suchthat numeric relationships of “0” are not stored.

Search-query terms and/or other terms (e.g., potential or actual relatedindicator terms) can include a single word, a set of words or a phrase.For example, in some instances, processing subsystem 302 detectsindividual words within a search query and identifies, for each word,other terms (e.g., individual words, groups of words such as bigrams ortrigrams or both individual words and groups of words) related to theindividual words. In some instances, processing subsystem 302 detects agroup of words (e.g., bigrams or trigrams) and identifies other termsrelated to the group of word. In some instances, processing subsystem302 identifies other terms related to all words in a query and/or usinga combination of techniques (e.g., identifying words related to eachword, with each bigram, with each trigram, etc.). The related indicatorterms and/or term relationship structure 318 can be updated periodicallyor continuously based on, e.g., new user input or new empirical data.

Using the related indicator terms information, processing subsystem 302can then perform particular search query. Specifically, after receivinga search query, processing subsystem 302 can look up one or moreindicator terms in term relationship structure 318 related to one ormore terms in the search query. These indicator terms can be used whilesearching through a set of documents stored in app database 160. The setof documents used for real-time searches can be the same or can differfrom the set of documents used to determine indicator terms. Forexample, the set of documents used to determine related indicator termscan include a sampling of documents or documents available at a pasttime point, whereas a set of documents used during a real-time searchcan include all available documents.

Processing subsystem 302 can expand a search query to include one ormore related indicator terms. In some instances, a series of searchesare performed. As one example, a first search may use the actualsearch-gum term(s). A second search can search within results of thefirst search using most related indicator terms. A third search cansearch within results of the second search using next-most relatedindicator terms. As another example, a single search may use the actualsearch-query terms) and related indicator terms. The related indicatorterms may be weighted while performing the search.

The search can also be affected by non-textual properties, such as auser popularity of an app, an app price, or an app release date. Resultsfrom the search can include a list of documents (and/or associated appsor data) and can include a ranking of the documents.

Results from the search can be transmitted from remote server 150 todevice 105 in one or more transmissions. Device 105 can present thesearch results to the user via the user output device 208. For example,for each search result, one or more of the following information typescan be presented for the app: its name, publisher, price, category,brief description, number of downloads, average rating, devicecompatibility, size, version, update date, languages, screen shots,and/or user reviews. In some instances, some types of information areinitially presented (e.g., in a list or grid presentation of the searchresults), and other types of information are available upon a userselecting a search resulting. For example, an initial list display canidentify an “Mail” app, note its price, publisher and category. A usercan then select (e.g., tap or click) on the app representation, and abrief description, number of downloads, average rating, devicecompatibility, size, version, update date, languages, screen shotsand/or user reviews can be further presented for the app.

A user can choose to download one of the apps presented in the searchresults by interacting the with user input device 206. Device 105 cantransmit a request for the app to remote server 150 via networkinterface 210. Device 105 can additionally (in a same or differenttransmission) or alternatively send other types of information, such aswhich apps the user selected to view an extended information profile.Remote server 150 can respond to the app request by transmitting therequested app to device 105 via network interface 310. In someinstances, device 105 or remote server 150 can require, e.g., paymentinformation (e.g., a credit-card number and expiration date,financial-system login information, and/or payment authorization) from auser prior to transmitting a request for the app to remote server 150,transmitting the app to device 105, or providing the app to the user.Remote server 150 can update its empirical query data 316 based on,e.g., the download request or other information that was provided (e.g.,which apps were viewed).

It will be appreciated that device 105 and remote sever 150 describedherein are illustrative and that variations and modifications arepossible. A device 105 can be implemented as a mobile electronic deviceand can have other capabilities not specifically described herein (e.g.,telephonic capabilities, power management, accessory connectivity,etc.). In a system with multiple devices 105 and/or multiple remoteservers 150, different devices 105 and/or remote servers 150 can havedifferent sets of capabilities; the various devices 105 and/or remoteservers 150 can be but need not be similar or identical to each other.

Further, while device 105 and remote server 150 are described withreference to particular blocks, it is to be understood that these blocksare defined for convenience of description and are not intended to implya particular physical arrangement of component parts. Further, theblocks need not correspond to physically distinct components. Blocks canbe configured to perform various operations, e.g., by programming aprocessor or providing appropriate control circuitry, and various blocksmight or might not be reconfigurable depending on how the initialconfiguration is obtained. Embodiments of the present invention can berealized in a variety of apparatus including electronic devicesimplemented using any combination of circuitry and software.

Additionally, while device 105 and remote server 150 are described assingular entities, it is to be understood that each can include multiplecoupled entities. For example, remote server 150 can include, a server,a set of coupled servers, a computer and/or a set of coupled computers.

III. Using Full Table of Terms

FIG. 4 is a flow diagram of a process 400 for determining indicatorterms related to query terms according to an embodiment of the presentinvention. Process 400 can be implemented, e.g., in remote server 150 ofFIG. 3. Process 400 can be performed to create a co-occurrence matrix ofall the words (or a subset of substantive words, e.g., nouns and verbs)used in summaries (or other metadata) of apps in a database. Theco-occurrence matrix can then be used to augment a search using a queryterm.

At block 402, a set of documents can be accessed (e.g., from appdatabase 160). Each document within the set can be associated with adigital file, such as an app. The document can include textual data,such as a title or description of the digital file. The document canfurther indicate, e.g., the digital file's popularity, rating,availability date, number of times that users downloaded the digitalfile, number of times that users requested more information about thedigital file, or a publisher.

The set of documents can include a real-time set or a stored set. Forexample, the set of documents can include some or all apps currentlyavailable to some or all users, or the set of documents can include someor all apps available to some or all users at a past time point. In someinstances, the set of documents includes a sampling of availabledocuments (e.g., a randomly selected or pseudo-randomly selected tenthof available documents). In some instances, the set of documentsincludes a training set, which can include documents associated withactual files or which can include documents associated with fictitiousdigital files.

At block 404, a set of potential query terms can be identified. In someinstances, the potential query terms can include query terms frequentlyused by users during queries. For example, the potential query terms canbe identified by consulting empirical query data 316 (e.g., query logs)and identifying terms which were entered by users more than an absoluteor relative number of times (e.g., to identify the 5,000 terms mostfrequently entered by users). In some instances, the potential queryterms include some or all terms within the set of documents. Forexample, the potential query terms can include all nouns and verbs inthe set of documents, all nouns and verbs occurring more than anabsolute or relative number of times in the set of documents, or allnouns and verbs occurring at least once in a threshold number ofdocuments, The potential query terms can be constrained according to oneor more threshold numbers of words. For example, a criterion canindicate that all potential query terms are to be of a specific wordlength (e.g., one word, two words or three words).

Blocks 406-410 be repeated for each identified potential query term. Atblock 406, a subset of documents can be identified. In some instances,the subset of documents is identified based on users' interactions withdocuments. For example, the subset of documents can include thoseassociated with digital files (e.g., apps) for which users were likelyto request additional information, click on or download. The subset ofdocuments can be tied to interactions following comparable or samequeries. As a specific illustrative example, if an identified query ofinterest is “pool game”, the subset of documents can include the appdescriptions for the 25 apps observed in the logs to be most-downloadedby users subsequent to issuing that same search query.

In some instances, the subset of documents can include documents withinthe set of documents that include the identified potential query term,that include the identified potential query term in a specified field(e.g., “title” or “description”), or that include the identifiedpotential query term more than a threshold number of times.

At block 408, one or more related indicator terms can be identified. Therelated indicator terms can be identified using a relatedness metric.For a given term, the relatedness metric can indicate and/or can beinfluenced by a number of times that the term appears within the subsetof documents, a number of documents within the subset of documents inwhich the term appears at least once, a word separation between the termand the identified potential query term within one or more of the subsetof documents, and/or whether the term and identified potential queryterm appear within a same field in one or more of the subset ofdocuments. The relatedness metric can be a normalized metric. Forexample, the metric can be normalized to constrain a top possible value,to constrain a sum of values across related indicator terms for a givenpotential query term, to account for a number of terms appearing withinone or more documents, and/or to account for a number of documents inthe subset of documents. The relatedness metric can be determined basedon a weighted calculation. For example, an app associated with onedocument within the subset may have been downloaded by 50% of the usersfollowing entry of a search term, while an app associated with anotherdocument may have been downloaded by 10% of the users. Thus, terms inthe first document may be weighted more heavily than terms in the seconddocument.

As one particular example, a relatedness metric can be defined as anumber of documents including both the identified potential query termand another term (e.g., a potential related indicator term) divided by anumber of documents containing the identified potential query termand/or the other term. As another particular example, a relatednessmetric can be defined as a number of documents including both theidentified potential query term and another term divided by a number ofdocuments containing the identified potential query term and/or theother term multiplied by an inverse-document-frequency term. Theinverse-document-frequency can be defined as the log of the ratio of thenumber of documents in an analyzed set divided by the number ofdocuments in the analyzed set including a term of interest.

Identification of related indicator terms can include assessing acriterion or comparing a relatedness metric to one or more thresholds.For example, a criterion can specify that related indicator terms arethose associated with a relatedness metric above a threshold, or thatrelated indicator terms include terms associated with the, e.g., 10highest relatedness metrics for a particular potential query term. Thus,in some instances, a binary characterization of related indicator termscan be provided—either a term is related to a potential query term or itis not. In some instances, the binary characterization is provided alongwith a more detailed assessment. For example, a list of 20 relatedindicator terms can be identified, and each can be associated with aweight or relatedness metric to indicate how closely the indicator termis related to the potential query term. In some instances, no binarycharacterization is provided. Rather, a weight or relatedness metric canindicate how closely the indicator term is related to the potentialquery term.

At block 410, term relationship structure 318 is constructed or updatedto reflect the related indicator terms. For example, term relationshipstructure 318 can be updated to add (e.g., to a list or table) newrelated indicator terms associated with the potential query term, toremove (e.g., from a list or table) terms from being associated with thepotential query term that are no longer identified as being related, tochange or add relatedness metrics between the potential query term andanother term, etc.

Examples of term relationship structure 318 are shown in FIGS. 5A-5C.FIG. 5A shows an example in which term relationship structure 318includes a list of words identified as being related to the potentialquery term of “Email”. Thus, in this instance, term relationshipstructure 318 amounts to a vector, with each cell including a relatedindicator term. In some instances, the vector could be expanded to amatrix, e.g., such that each row corresponds to a different potentialquery term. Different potential query terms can be associated with asame or different number of related indicator terms. For example, thenumber can vary if a criterion indicates that a relatedness metric mustexceed an absolute threshold prior to a characterization of beingrelated. In some instances, multiple term relationship structures 318exist (e.g., one associated with each potential query term).

FIG. 5B shows an example in which term relationship structure 318includes a list of words identified as being related to the potentialquery term of “Email” and relatedness metrics associated with eachrelated indicator term. Thus, in this illustrative example, “Compose” isidentified as having a stronger relationship with “Email” as compared to“Filter.

FIG. 5C shows an example of a term relationship structure indicating arelatedness for multiple potential query terms. In this instance, arelatedness metric between each potential query term and each otherpotential query term is calculated and identified. Notably, therelatedness metrics are directional. For example, a potential searchquery term of “Email” has a relatedness metric of 0.21 relative toanother term of “Compose”. Meanwhile, a potential search query term of“Compose” has a lower relatedness metric of 0.08 relative to anotherterm of “Email” (e.g., because “compose” may instead relate tomusical-composition concepts).

FIG. 6 is a flowchart illustrating a method 600 for using a termrelationship structure to respond to a query. Method 600 can be used toaugment a search using a particular query term, e.g., by identifying arow of a co-occurrence matrix to determine additional query terms andweighting factors.

At block 602, a query can be received. The query can be indicative of asearch for an app and can include one or more words. The query can bereceived at a server from a client device, such as a computer or mobilephone.

At block 604, each query term in the query can be identified. In someinstances, a query term is a single word. For example, in a query of“how to make a cake”, block 620 could include identifying five one-wordterms or a subset of the five words (e.g., not including “a” and “to” asterms due to their non-substantive nature). In some instances, a queryterm includes a set of words, such as a bigram or trigram. Theidentified query terms can include overlapping words within the query.For example, terms identified for a query “speed read” could includeboth individual words and the pair of words.

Blocks 606-610 can be performed for each identified query term. At block606, an identified query term can be used to access a row of a termrelationship structure to identify related indicator terms. The relatedindicator terms can include those with relatedness metrics above athreshold or being non-zero. In some instances, the related indicatorterms include n terms with the highest relatedness metrics.

At block 608, for each document within a set of documents, a number oftimes that each related indicator term and the query term appears can becounted. Thus, a query term with 10 related indicator terms couldproduce 11 counts. In some instances, separate counts are performed forseparate sections of the document (e.g., a title versus a description

At block 610, each count can be scaled by a weighting factor, e.g., toobtain a weighted count. The weighting factor can be determined based onthe relatedness metric from the term relationship structure. Thus, thepresence of an indicator term in a document that is highly related to asearch-query term can be more heavily weighted than the presence of anindicator term with a weaker relation.

At block 612, a document score can be determined for each document basedon the weighted count. The score can include a sum of the weightedcounts associated with the document. For example, in the example of the“speed read” query, if 10 indicator terms are identified as beingrelated to each of the three query terms, then 33 weighted counts can besummed for each document. The score can further account for otherfactors. For example, the score can normalize counts based on a numberof terms in a document or a frequency of the indicator term across a setof documents.

At block 614 document rankings can be determined using the documentscores. For example, documents associated with larger scores can beassigned a lower ranking (e.g., being ranked closer to first) thandocuments associated with lower scores. The rankings can define anordered list which can then be provided to a user who initiated thequery.

IV. Using Empirical User Feedback

A potential difficulty of term analysis is identifying the potentialquery terms and potential indicator terms in the first place. Especiallywhen multi-word terms are to be identified in search queries and to beidentified as a related indicator term, many possible terms can beconsidered. This number can result in long processing times to considereach term. Terms can be identified based on those used in documents, butgiven a large document set, the term set can still be very large. Insome instances, terms (e.g., indiciator terms) are identified based onterms in documents actually accessed by users. Thus, indicator termsimportant to users can be identified, while processing time can beconstrained by ignoring infrequently used terms.

FIG. 7 is a flow diagram of a process 700 for determining indicatorterms related to query terms according to an embodiment of the presentinvention. Process 700 can be implemented, e.g., in remote server 150 ofFIG. 3. Process 700 can include a more detailed implementation ofprocess 400 shown in FIG. 4. Blocks with similar or same text can beperformed in manners similar to that described with respect to FIG. 4.

At block 702, the set of potential query terms can be identified usingempirical queries (e.g., stored as empirical query data 316). The set ofpotential query terms can include terms entered by other users in pastqueries. In some instances, the set or potential query terms isidentified by parsing queries into terms of particular word lengths(e.g., into one-word terms, into two-word terms, etc.). For example,each of the following queries could favor defining a potential queryterm as “email”: “Email”, “email friends”, “email app”, “emailmanagement” and “email celebrities”. In some instances, the set ofpotential query terms is defined based on exact matches compared to fullqueries. For example, in the above instance, only the “Email” querywould favor defining the potential query term as “email”. Other termidentification criteria can include identifying the most common queryterms or identifying query terms with low success rates (e.g., such thatusers do not subsequently click on or download a provided searchresult).

Blocks 704-708 can be performed for each identified potential queryterm. At block 704, a subset of documents can be identified based onempirical user query-responsive actions for a specific query term. Thesubset of documents can be used to identify indicator terms related tothe specific query term. The subset of documents can be identified assome of the documents in a set of documents, the set of documents beingassociated with all digital files presently and/or previously availableto users. For example, the subset of documents can include documentsassociated with digital files that users were likely to download orrequest additional information about following a query.

As a specific example, subsequent to entry of an “email” search query,empirical query data 316 can indicate that search results included aMail app 1, a Yourmail app, a Mymail app and a pool-game app. Empiricaldata 316 can further indicate that 15% of users downloaded the Mail app,7% of users downloaded the Yourmail app, 4% of users downloaded theMymail app, and 0% of users downloaded the pool-game app. The subset ofdocuments can be defined to include the Mail, Yourmail and Mymail app(e.g., by comparing the percentages to a threshold) and to exclude thepool-game app.

By using empirical feedback, the universe of related documents isnarrowed, Moreover, as the documents have a high functional relevance,the words in the documents will more likely be relevant to the queryterm. In this manner, underlying functional relationship can beidentified, as opposed to loose relationships obtained simply from usingthe word in a document describing an app. Further, the subset ofdocuments indicates practical relevance, indicating a degree to which adocument was of interest to a user in the app context. This practicalrelevance can be advantageous to identify over relationships identifiedfrom other contexts or founded based on a theoretical basis.

At block 706, related indicator terms can be defined as those prevalentin the subset of documents. For example, the related indicator terms caninclude terms appearing in an above-threshold percentage of documents orappearing more than a threshold number of times within the documents.

At block 708, a term relationship structure can be constructed orupdated to identify the related indicator terms. In some instances,block 708 includes adding the related indicator terms to a row, columnor vector of the term relationship structure. In some instances, block708 includes defining or updating a weighting factor associated with theterm and the related indicator term.

FIG. 8 is a flow diagram of a process 800 for determining indicatorterms related to query terms according to an embodiment of the presentinvention. Process 800 can be implemented, e.g., in remote server 150 ofFIG. 3. Process 800 can include a more detailed implementation ofprocess 400 shown in FIG. 4. Blocks with similar or same text can beperformed in manners similar to that described with respect to FIG. 4and/or FIG. 7.

At block 802, a set of potential query terms can be identified and canbe identified, e.g., as described with respect to block 502.

At block 804, for each potential query term, the subset of documents canbe identified by performing an a keyword lookup to retrieve documentscontaining the potential query term. The subset of documents can beidentified as some of the documents in a set of documents, the set ofdocuments being associated with all digital files presently and/orpreviously available to users or training documents. Block 804 caninclude, e.g., identifying a subset of documents within a set ofdocuments associated with apps available to users on a previous day, thesubset including documents including the potential query term at leastonce or at least a threshold number of times (e.g., generally or withinone or more particular fields).

Blocks 806 and 808 then parallel blocks 706 and 708. However, therelatedness metric in this instance may also account for inter-termfactors such as a number of words separating a potential query term andanother potential indicator term. While, e.g., process 700 can includesimilar influences, such techniques can be less reliable in thoseinstances due to a reduced assuredness that each document within thesubset of documents will include the potential query term.

FIG. 9 is a flow diagram of a process 900 for performing a search for adigital file according to an embodiment of the present invention.Process 900 can be implemented, e.g., in remote server 150 of FIG. 3.

At block 902, a search query is received from a user device 105. Thesearch query can include and/or consist of one or more query terms.

At block 904, the one or more query terms can be used to lookupassociated related indicator terms in a term relationship structure. Asa result of block 904, one or more related indicator terms can beidentified, each of which in some instances can be associated with arelatedness metric or weight (e.g., the weight depending on therelatedness metric).

As an example, a query term (e.g., a unigram or bigram) can be used toretrieve a row/column or other set of data where indicator terms relatedto the search term is stored. In one implementation, the term can behashed to provide a direct memory address for retrieving the relatedindicator terms. In another implementation, the hash can be used toaccess an index table for obtaining the memory address. The size of datato be retrieved can also be stored in the index table.

At block 906, a search is performed. In one instance, an initial searchcan be performed using the one or more terms in the search query.Results of the initial search can then be refined (e.g., pruned, rankedor re-ranked) using related indicator terms identified at block 904, inone instance, a single search (e.g., and ranking) is performed using theone or more terms in the search query and the related indicator termsidentified at block 904. The related indicator terms can be weighted,such that a document including a highly weighted related indicator termcan be preferred over a document including a less highly weightedrelated indicator term.

At block 908, the search results can be transmitted to user device 105.The search results can include an ordered list of digital file, a groupof documents associated with digital files, or data from a group ofdocuments associated with digital files. In some instances, the relatedindicator terms are also transmitted to user device 105.

User device 105 can then present the search results to the user, e.g.,as a linear list or grid-style list. The search results can be presentedin a manner that allow for the user to interact with one or moreresults. For example, a user may be able to select one search result,and additional information about the selected search result can bepresented (e.g., on a new page, a pop-up window or on a current page).As another example, a user may be able to request a download of asearch-result app. In some instances, downloading an app (e.g., an appthat is not free) can require purchasing the app. Data about theseadditional interactions can be transmitted back to remote server 150,e.g., such that empirical query data 316 can be updated.

Some disclosed embodiments illustrate a multi-step process: first, aterm relationship structure created to identify terms related topotential search-query terms; and second, a search is performed usingthe related indicator terms. It will be understood that the disclosurescan be extended to embodiments in which these processes are performednearly simultaneously. For example, upon receiving a search-query term,a term relationship structure can be constructed or updated to identifyrelated indicator terms in real-time, and a search can then be performedusing the related indicator terms.

Embodiments described herein can efficiently and effectively respond tousers'search queries. Textual-based searches can have multiplelimitations. First, textual-based searches tied to exact terms in users'queries can overlook potential search results that relate to a user'sconcept of interest but that do not include the exact terms. Users areforced to express their search queries in a finite number of words, andthus, any given query can (intentionally or accidentally) omit termsthat would produce search results still of interest to the user. Second,textual-based searches can return search results that are not ofinterest to the user but happen to include search-query terms withinassociated documents. For example, incidental terms in descriptions cangive rise to identification of the documents as pertaining to thesearch.

Meanwhile, embodiments herein can extend searches beyond the precisevocabulary used within a query. Related indicator terms can beidentified based on terms appearing within documents estimated to be ofmost actual interest to users (e.g., by assessing their click-throughand downloading activity) or based on terms that frequently co-occurwith search-query terms. This process can be performed offline, suchthat related indicator terms can be immediately identified upon receiptof a query. Search results can be identified, refined and/or ranked inaccordance with the related indicator terms. Therefore, search resultscan be more likely to match users' expectations.

Many disclosures herein are tied to search queries for apps. It will beappreciated that the disclosures can be extended to search queriespertaining to other digital files (e.g., content files), such as audiofiles (e.g., music files or podcast files) and/or video files.

A number of processes are disclosed herein. The processes can beperformed in part or in their entireties by a computer, processor,electronic circuit, etc. Thus, each process block can, in someembodiments, be performed by, e.g., a computer.

Portions of the description can refer to particular user interfaces,such as touchscreen displays. Other embodiments can use differentinterfaces. For example, a user interface can be voice-based, with theuser speaking instructions into a microphone or other audio input deviceand the device providing an audible response (e.g., using synthesizedspeech or pre-recorded audio clips). A combination of voice-based andvisual interface elements can be used, and in some embodiments, multipledifferent types of interfaces can be supported, with the user having theoption to select a desired interface, to use multiple interfaces incombination (e.g., reading information from the screen and speakinginstructions) and/or to switch between different interfaces. Any desiredform of user interaction with a device can be supported.

Embodiments of the present invention can be realized using anycombination of dedicated components and/or programmable processorsand/or other programmable devices. The various processes describedherein can be implemented on the same processor or different processorsin any combination. Accordingly, where components are described as beingconfigured to perform certain operations, such configuration can beaccomplished, e.g., by designing electronic circuits to perform theoperation, by programming programmable electronic circuits (such asmicroprocessors) to perform the operation, or any combination thereof.Processes can communicate using a variety of techniques including butnot limited to conventional techniques for interprocess communication,and different pairs of processes can use different techniques, or thesame pair of processes can use different techniques at different times.Further, while the embodiments described above can make reference tospecific hardware and software components, those skilled in the art willappreciate that different combinations of hardware and/or softwarecomponents can also be used and that particular operations described asbeing implemented in hardware might also be implemented in software orvice versa.

Computer programs incorporating various features of the presentinvention can be encoded and stored on various computer readable storagemedia; suitable media include magnetic disk or tape, optical storagemedia such as compact disk (CD) or DVD (digital versatile disk), flashmemory, and other non-transitory media. Computer readable media encodedwith the program code can be packaged with a compatible electronicdevice, or the program code can be provided separately from electronicdevices (e.g., via Internet download or as a separately packagedcomputer-readable storage medium).

Thus, although the invention has been described with respect to specificembodiments, it will be appreciated that the invention is intended tocover all modifications and equivalents within the scope of thefollowing claims.

What is claimed is:
 1. A method of defining a data structure to use forexpanding a search query requesting relevant software applications froma database storing software applications and a set of documents, eachdocument being associated with a software application, the methodcomprising: identifying, with a server, a set of N potential query termsby analyzing words in the set of documents, N being an integer greaterthan 100; defining a co-occurrence matrix defined by an N×N array ofmatrix elements, wherein each row corresponds to a respective one of theN potential query terms and each column corresponds to a respective oneof the N potential query terms; for each potential query term I:identifying, with the server, a subset of documents within the set ofdocuments in which the potential query term I exists; for each of theN−1 other potential query terms: analyzing, with the server, the subsetof documents to identify a respective number of times the respectiveother potential query term occurs in the subset of documents; andcalculating, with the server, a co-occurrence score corresponding tomatrix element (I,J) based on the respective number; and constructing,with the server, the co-occurrence matrix using the co-occurrencescores.
 2. The method of claim 1, further comprising; receiving a queryterm; identifying the row of the co-occurrence matrix corresponding tothe query term; retrieving at least non-zero co-occurrence scores storedin the identified row of the co-occurrence matrix.
 3. The method ofclaim 1, wherein constructing the co-occurrence matrix using theco-occurrence scores includes: storing only the co-occurrence scoresgreater than a threshold.
 4. The method of claim 1, wherein analyzingwords includes: identifying which words correspond to nouns and verbs;and using the nouns and verbs as the set of potential query terms.
 5. Anon-transitory machine readable medium storing executable instructionswhich when executed by a system cause the system to perform a method ofdefining a data structure to use for expanding a search query requestingrelevant software applications from a database storing softwareapplications and a set of documents, each document being associated witha software application, the method comprising: identifying, with aserver, a set of N potential query terms by analyzing words in the setof documents, N being an integer greater than 100; defining aco-occurrence matrix defined by an N×N array of matrix elements, whereineach row corresponds to a respective one of the N potential query termsand each column corresponds to a respective one of the N potential queryterms; for each potential query term I: identifying, with the server, asubset of documents within the set of documents in which the potentialquery term I exists; for each of the N−1 other potential query terms:analyzing, with the server, the subset of documents to identify arespective number of times the respective other potential query term Joccurs in the subset of documents; and calculating, with the server, aco-occurrence score corresponding to matrix element (I,J) based on therespective number; and constructing, with the server, the co-occurrencematrix using the co-occurrence scores.
 6. The medium of claim 5, furthercomprising: receiving a query term; identifying the row of theco-occurrence matrix corresponding to the query term; retrieving atleast non-zero co-occurrence scores stored in the identified row of theco-occurrence matrix.
 7. The medium of claim 5, wherein constructing theco-occurrence matrix using the co-occurrence scores includes: storingonly the co-occurrence scores greater than a threshold.
 8. The medium ofclaim 5, wherein analyzing words includes: identifying which wordscorrespond to nouns and verbs; and using the nouns and verbs as the setof potential query terms.